In this quiz, I had to recognize and decide when to use which stats models for each question. The models showcased are a linear regression, multivariate linear regression, and one-way ANOVA.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pander)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
##
## The following object is masked from 'package:pander':
##
## wrap
library(broom)
coag <- read.csv("coagulation.csv", header = TRUE)
infmort <- read.csv("infmort.csv", header = TRUE)
pima<- read.csv("pima.csv", header = TRUE)
prostate<- read.csv("prostate.csv", header = TRUE)
#lm, mlreg, anova
#Hypothesis, EDA, hypothesis testing, covariance, normality, fit tests, visualization
Question 1; 1.NAs removed. See Below. 2. Linear Regression. Null Hypothesis: There is no statistically significant association between family history (diabetes) and those with and without diabetes(test). Alternative Hypothesis: There is statistically significant association between family history (diabetes) and those with and without diabetes(test). Significance will be set to 0.05, standard alpha level.
Results: P-value: 2e-16 There is statistically significant association between family history (diabetes) and those with and without diabetes(test). The Reject the Null. Adjusted R-squared = 0.029, the model doesn’t explain much of the variance seen Coefficients are significant Residuals= Median=-0.0967 Residual standard error=0.327 F-stat= 23.87
Conclusion: There is a link between family history and currently having diabetes. 3.I decided to keep test in the regression model so its portion of the explanation of variance is accounted for. Variables associated with diabetes = test, triceps, & insulin. 4.I decided to keep test in the regression model so its portion of the explanation of variance is accounted for. Variables associated with test = diabetes, pregnant, glucose, diastolic, & bmi. Makes sense if you have diabetes, then these variables are closely linked like glucose.
#1 Remove NAs
pima[, 2:6][pima[, 2:6]==0] = NA
head(pima)
#2
#Run Linear Regression for Assumption functions to makes sense
LinearReg <- lm(diabetes ~ factor(test), data = pima)
summary(LinearReg)
##
## Call:
## lm(formula = diabetes ~ factor(test), data = pima)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46250 -0.22556 -0.09673 0.15227 1.89927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.42973 0.01460 29.431 < 2e-16 ***
## factor(test)1 0.12077 0.02472 4.886 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3265 on 766 degrees of freedom
## Multiple R-squared: 0.03022, Adjusted R-squared: 0.02896
## F-statistic: 23.87 on 1 and 766 DF, p-value: 1.255e-06
#Dummy variable: Test is an integer, but it is categorical ergo the dummy transformation
pimaDum <- mutate(pima, test.dum = factor(test))
LinearReg_Dum <- lm(diabetes~test.dum, pimaDum) #final linear regression, but to work with the assumptions below I put it here
summary(LinearReg_Dum)
##
## Call:
## lm(formula = diabetes ~ test.dum, data = pimaDum)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46250 -0.22556 -0.09673 0.15227 1.89927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.42973 0.01460 29.431 < 2e-16 ***
## test.dum1 0.12077 0.02472 4.886 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3265 on 766 degrees of freedom
## Multiple R-squared: 0.03022, Adjusted R-squared: 0.02896
## F-statistic: 23.87 on 1 and 766 DF, p-value: 1.255e-06
#EDA
#Preliminary EDA
#Summaries
dim(pima)
## [1] 768 9
str(pima)
## 'data.frame': 768 obs. of 9 variables:
## $ pregnant : int 6 1 8 1 0 5 3 10 2 8 ...
## $ glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ diastolic: int 72 66 64 66 40 74 50 NA 70 96 ...
## $ triceps : int 35 29 NA 23 35 NA 32 NA 45 NA ...
## $ insulin : int NA NA NA 94 168 NA 88 NA 543 NA ...
## $ bmi : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 NA ...
## $ diabetes : num 0.627 0.351 0.672 0.167 2.288 ...
## $ age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ test : int 1 0 1 0 1 0 1 0 1 1 ...
summary(pima)
## pregnant glucose diastolic triceps
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :29.00
## Mean : 3.845 Mean :121.7 Mean : 72.41 Mean :29.15
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## NA's :5 NA's :35 NA's :227
## insulin bmi diabetes age
## Min. : 14.00 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.: 76.25 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median :125.00 Median :32.30 Median :0.3725 Median :29.00
## Mean :155.55 Mean :32.46 Mean :0.4719 Mean :33.24
## 3rd Qu.:190.00 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## NA's :374 NA's :11
## test
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
##
pima %>%
group_by(factor(test)) %>%
summarise(n = n(), mean = mean(diabetes), sd = sd(diabetes),
median = median(diabetes), IQR = IQR(diabetes)) %>% pander
| factor(test) | n | mean | sd | median | IQR |
|---|---|---|---|---|---|
| 0 | 500 | 0.4297 | 0.2991 | 0.336 | 0.332 |
| 1 | 268 | 0.5505 | 0.3724 | 0.449 | 0.4655 |
#Plots
ggplot(pima, aes(test))+
geom_bar()
ggplot(pima, aes(diabetes))+
geom_histogram()+
geom_vline(aes(xintercept = mean(diabetes)), color = "red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(pima, aes(diabetes))+
geom_boxplot() #alot of outliers
ggplot(pima, aes(x= factor(test), y= diabetes))+
geom_boxplot()
#Assumptions
#1.Independent observations = The family history and having diabetes of a participant is not dependent on another participant's history or diabetes presence.
#2.Linearity = Not linear
par(mfrow=c(2,2))
plot(LinearReg) #Residual vs Fitted plot is linear, but only between two points
crPlots(LinearReg) # Most of the variables' residuals are not linear
#3 Homoscedasticity/ constant variability = Not homoscedastic.
par(mfrow=c(2,2))
plot(LinearReg)#See plot(multiReg)'s Scale-Location plot. The line is not horizontal and it only between two points.
#4 Normality of Residuals = Residuals are not normal. Need to log(diabetes).
#See plot(LinearReg)'s Nornal Q-Q plot OR the following:
pima.data <- augment(LinearReg_Dum)# get the residuals
ggplot(pima.data, aes(sample = .std.resid)) +
geom_qq() +
stat_qq_line(color = "green") #Doesn't follow a straight line, not normal at all
ggplot(pima.data, aes(x = .resid)) +
geom_histogram(bins = 35) #Normal enough, maybe a little platykurtic
#Log function
LinearReg1 <- lm(log(diabetes)~ test.dum, data = pimaDum)
summary(LinearReg1)
##
## Call:
## lm(formula = log(diabetes) ~ test.dum, data = pimaDum)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.62933 -0.47994 -0.03252 0.47709 1.89052
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.04508 0.02836 -36.851 < 2e-16 ***
## test.dum1 0.24399 0.04801 5.082 4.69e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6341 on 766 degrees of freedom
## Multiple R-squared: 0.03262, Adjusted R-squared: 0.03136
## F-statistic: 25.83 on 1 and 766 DF, p-value: 4.689e-07
crPlots(LinearReg1)
par(mfrow=c(2,2))
plot(LinearReg1)#A lot more normal now
(length(boxplot(pima$diabetes, plot=F)$out)/768)*100 #I could remove outliers here, but I've decided against it since there are so many of them (3.8% of data are outliers). That seems like a lot to cut out.
## [1] 3.776042
#3 Variables associated with diabetes: test, triceps, & insulin
ExploreDiabetes <- lm(log(diabetes) ~ test.dum + pregnant + glucose + diastolic + triceps + insulin + bmi + age, pimaDum)
summary(ExploreDiabetes) %>% pander
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | -0.9295 | 0.2424 | -3.834 | 0.0001473 |
| test.dum1 | 0.2436 | 0.0805 | 3.025 | 0.00265 |
| pregnant | -0.01207 | 0.01332 | -0.9064 | 0.3653 |
| glucose | -0.0003842 | 0.001398 | -0.2748 | 0.7836 |
| diastolic | -0.005291 | 0.00274 | -1.931 | 0.05419 |
| triceps | 0.001903 | 0.004006 | 0.4751 | 0.635 |
| insulin | 0.0001546 | 0.0003251 | 0.4755 | 0.6347 |
| bmi | 0.006663 | 0.00624 | 1.068 | 0.2864 |
| age | 0.005521 | 0.004451 | 1.24 | 0.2156 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 392 | 0.6127 | 0.06126 | 0.04165 |
#4 Variables associated with diabetes: diabetes, pregnant, glucose, diastolic, & bmi
ExploreTest <- lm(test ~ diabetes + pregnant + glucose + diastolic + triceps + insulin + bmi + age, pima)
summary(ExploreTest) %>% pander
| Â | Estimate | Std. Error | t value | Pr(>|t|) |
|---|---|---|---|---|
| (Intercept) | -1.103 | 0.1436 | -7.681 | 1.337e-13 |
| diabetes | 0.1572 | 0.05804 | 2.708 | 0.007066 |
| pregnant | 0.01295 | 0.008364 | 1.549 | 0.1223 |
| glucose | 0.006409 | 0.0008159 | 7.855 | 4.071e-14 |
| diastolic | 5.465e-05 | 0.00173 | 0.03158 | 0.9748 |
| triceps | 0.001678 | 0.002522 | 0.6652 | 0.5063 |
| insulin | -0.0001233 | 0.0002045 | -0.6031 | 0.5468 |
| bmi | 0.009325 | 0.003901 | 2.391 | 0.0173 |
| age | 0.005878 | 0.002787 | 2.109 | 0.03559 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 392 | 0.3853 | 0.3458 | 0.3321 |
Question 2: 1.Small sample is a problem for ANOVAs. It’s a assumption that is not as robust as other assumptions. There isn’t enough points to be normal and outliers could really affect data that is too small. 2.ANOVA Null Hypothesis: There is no difference between coagulation time means between diet types. Alternative Hypothesis: There is a difference between at least one of the coagulation time means between diet types. Significance will be set to 0.05, standard alpha level.
ANOVA results: There is a statistically significant between means. Reject the null. P-value=4.66e^05 F=13.57 Bonferroni = D is significantly different from B & C. A is sig dif from B & C.
Conclusion: There is a difference between diets and coagulation times. The diets that differ are D from B & C and A differs from B & C.
#1
#EDA
#Peliminary EDA
dim(coag)
## [1] 24 2
str(coag)
## 'data.frame': 24 obs. of 2 variables:
## $ coag: int 62 60 63 59 63 67 71 64 65 66 ...
## $ diet: chr "A" "A" "A" "A" ...
summary(coag)
## coag diet
## Min. :56.00 Length:24
## 1st Qu.:61.75 Class :character
## Median :63.50 Mode :character
## Mean :64.00
## 3rd Qu.:67.00
## Max. :71.00
summary(coag$coag)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 56.00 61.75 63.50 64.00 67.00 71.00
summary(factor(coag$diet))
## A B C D
## 4 6 6 8
coag %>%
group_by(diet) %>%
summarise(n = n(), mean = mean(coag), sd = sd(coag),
median = median(coag), IQR = IQR(coag)) %>% pander
| diet | n | mean | sd | median | IQR |
|---|---|---|---|---|---|
| A | 4 | 61 | 1.826 | 61 | 2.5 |
| B | 6 | 66 | 2.828 | 65.5 | 2.5 |
| C | 6 | 68 | 1.673 | 68 | 0.75 |
| D | 8 | 61 | 2.619 | 61.5 | 3.25 |
#Plots
ggplot(coag, aes(diet, fill= diet))+
geom_bar()
ggplot(coag, aes(coag))+
geom_boxplot()+
geom_vline(aes(xintercept = mean(coag)), color = "green")
ggplot(coag, aes(x=coag, fill = diet)) +
geom_boxplot()+
facet_grid(diet ~ .)+
theme(legend.position = "none")
#Assumptions
#1.Independent observations = Assumed in the experiment's design
#2.Homogeneity = There is equal variance
leveneTest(coag$coag, factor(coag$diet)) #Equal variance
#3 Normality (especially because the sample size is small) = Not Skewed, but platykurtic. Not enough samples to really make an ideal normal distribution.
shapiro.test(coag$coag)#normal bc p-value is not significant
##
## Shapiro-Wilk normality test
##
## data: coag$coag
## W = 0.97759, p-value = 0.8476
qplot(data=coag, sample= coag, color=diet)#not linear nor lined up
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggplot(coag, aes(coag))+
geom_histogram(binwidth=.9)# Not enough points to be normal.
#4 No extreme outliers
ggplot(coag, aes(coag))+
geom_boxplot()+
facet_wrap("diet")#No extreme outliers. Can't remove data with such a small sample size anyways.
#2
#Run ANOVA test:
anova <- aov(coag ~ diet, data = coag)
summary(anova) %>% pander
| Â | Df | Sum Sq | Mean Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| diet | 3 | 228 | 76 | 13.57 | 4.658e-05 |
| Residuals | 20 | 112 | 5.6 | NA | NA |
#POST-TEST
bonferroni<- pairwise.t.test(coag$coag, coag$diet, p.adjust.method = "bonferroni") #bonferroni over tukey bc its more conservative
bonferroni #D is significantly different from B & C. A is sig difF from B & C.
##
## Pairwise comparisons using t tests with pooled SD
##
## data: coag$coag and coag$diet
##
## A B C
## B 0.02282 - -
## C 0.00108 0.95266 -
## D 1.00000 0.00518 0.00014
##
## P value adjustment method: bonferroni
Question 3: 1.NA/Noted 2.Multiple Regression Null Hypothesis: There is no relationship between psa and the rest of the variables Alternative Hypothesis: There is a relationship between psa and the rest of the variables Significance will be set to 0.05, standard alpha level.
Results: P-value: 0.255882 There is no statistically significant association between psa and lcavol, lweight, svi, age, and lbph collectively. Fail to reject the Null. Adjusted R-squared = 0.6245 , the model explains 62.45% of the variability seen Coefficients are significant: lcavol, lweight, svi Residual standard error=0.7073 F-stat= 32.94
Conclusion: Though some coefficients were significant, the overall main effect wasn’t, so we failed to find a significant relationship between psa and the variables we narrowed down to be the most influential on psa.
#2
#Run Multiple Regression for Assumption functions to makes sense
multiReg <- lm(lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + pgg45, data = prostate)
summary(multiReg)
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + lcp +
## gleason + pgg45, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7331 -0.3713 -0.0170 0.4141 1.6381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.669337 1.296387 0.516 0.60693
## lcavol 0.587022 0.087920 6.677 2.11e-09 ***
## lweight 0.454467 0.170012 2.673 0.00896 **
## age -0.019637 0.011173 -1.758 0.08229 .
## lbph 0.107054 0.058449 1.832 0.07040 .
## svi 0.766157 0.244309 3.136 0.00233 **
## lcp -0.105474 0.091013 -1.159 0.24964
## gleason 0.045142 0.157465 0.287 0.77503
## pgg45 0.004525 0.004421 1.024 0.30886
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7084 on 88 degrees of freedom
## Multiple R-squared: 0.6548, Adjusted R-squared: 0.6234
## F-statistic: 20.86 on 8 and 88 DF, p-value: < 2.2e-16
#EDA
#Preliminary EDA
dim(prostate)
## [1] 97 9
str(prostate)
## 'data.frame': 97 obs. of 9 variables:
## $ lcavol : num -0.58 -0.994 -0.511 -1.204 0.751 ...
## $ lweight: num 2.77 3.32 2.69 3.28 3.43 ...
## $ age : int 50 58 74 58 62 50 64 58 47 63 ...
## $ lbph : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ svi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lcp : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ gleason: int 6 6 7 6 6 6 6 6 6 6 ...
## $ pgg45 : int 0 0 20 0 0 0 0 0 0 0 ...
## $ lpsa : num -0.431 -0.163 -0.163 -0.163 0.372 ...
summary(prostate)
## lcavol lweight age lbph
## Min. :-1.3471 Min. :2.375 Min. :41.00 Min. :-1.3863
## 1st Qu.: 0.5128 1st Qu.:3.376 1st Qu.:60.00 1st Qu.:-1.3863
## Median : 1.4469 Median :3.623 Median :65.00 Median : 0.3001
## Mean : 1.3500 Mean :3.653 Mean :63.87 Mean : 0.1004
## 3rd Qu.: 2.1270 3rd Qu.:3.878 3rd Qu.:68.00 3rd Qu.: 1.5581
## Max. : 3.8210 Max. :6.108 Max. :79.00 Max. : 2.3263
## svi lcp gleason pgg45
## Min. :0.0000 Min. :-1.3863 Min. :6.000 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:-1.3863 1st Qu.:6.000 1st Qu.: 0.00
## Median :0.0000 Median :-0.7985 Median :7.000 Median : 15.00
## Mean :0.2165 Mean :-0.1794 Mean :6.753 Mean : 24.38
## 3rd Qu.:0.0000 3rd Qu.: 1.1786 3rd Qu.:7.000 3rd Qu.: 40.00
## Max. :1.0000 Max. : 2.9042 Max. :9.000 Max. :100.00
## lpsa
## Min. :-0.4308
## 1st Qu.: 1.7317
## Median : 2.5915
## Mean : 2.4784
## 3rd Qu.: 3.0564
## Max. : 5.5829
#Plots
ggplot(prostate, aes(lpsa))+
geom_histogram()+
geom_vline(aes(xintercept = mean(lpsa)), color = "blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Assumptions
#1.Independent observations = Assumed in design of experiment. Durbin-Watson could be applied.
#2.Linearity = Not linear
par(mfrow=c(2,2))
plot(multiReg) #Residual vs Fitted plot is not linear
crPlots(multiReg) # Most of the variables' residuals are not linear
#Maybe transforming lcavol will help. (Even though I see the log transformation has been applied to the relevant variables.)
prostate1 <- prostate %>% mutate(lcavol_c = scale(lcavol, scale = FALSE))
multiReg1 <- lm(lpsa ~ lcavol_c+ I(lcavol_c^2) + lweight + age + lbph + svi + lcp + gleason + pgg45, data = prostate1)
summary(multiReg1)
##
## Call:
## lm(formula = lpsa ~ lcavol_c + I(lcavol_c^2) + lweight + age +
## lbph + svi + lcp + gleason + pgg45, data = prostate1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.73429 -0.37817 -0.01007 0.37917 1.63319
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.429977 1.338502 1.068 0.28832
## lcavol_c 0.594688 0.094361 6.302 1.17e-08 ***
## I(lcavol_c^2) 0.011900 0.051249 0.232 0.81693
## lweight 0.448784 0.172677 2.599 0.01098 *
## age -0.019295 0.011330 -1.703 0.09213 .
## lbph 0.110603 0.060721 1.821 0.07197 .
## svi 0.754191 0.250981 3.005 0.00347 **
## lcp -0.109584 0.093203 -1.176 0.24290
## gleason 0.047276 0.158584 0.298 0.76633
## pgg45 0.004585 0.004453 1.030 0.30596
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7123 on 87 degrees of freedom
## Multiple R-squared: 0.655, Adjusted R-squared: 0.6193
## F-statistic: 18.35 on 9 and 87 DF, p-value: < 2.2e-16
crPlots(multiReg1)#It barely changed, so I will not be using this transformation going forward, but I wanted to check.
#3 Homoscedasticity/ constant variability = Not homoscedastic.
par(mfrow=c(2,2))
plot(multiReg)#See plot(multiReg)'s Scale-Location plot. The line is not horizontal enough to be homoscedastic. Since a log transformation has already been performed it won't get much straighter than this.
#4 Normality of Residuals = Residuals are not the most normal. It could be worse, but it is not great.
#See plot(multiReg)'s Nornal Q-Q plot OR the following:
pros.data <- augment(multiReg)# get the residuals
ggplot(pros.data, aes(sample = .resid)) +
geom_qq() +
stat_qq_line(color = "green") #Doesn't follow a straight line, not normal, but close
ggplot(pros.data, aes(x = .resid)) +
geom_histogram(bins = 35) #Normal enough, maybe a little platykurtic
(length(boxplot(prostate$lpsa, plot=F)$out)/97)*100 #I could remove outliers here, but I've decided against it since there are so many of them (4.1% of data are outliers). That is too much to cut out
## [1] 4.123711
#5 No multicollinearity = Multicollinearity is present in a number of variables.
ggpairs(data=prostate)
ggcorr(prostate)#gleason & pgg45 are highly correlated and that makes sense. Pgg45 is based on gleason scores. lpsa & lcavol, lcp&svi, lcp&lcavol, lcp&pgg45, lpsa&svi, and lcp&gleason are all relatively correlated, so multicollinearity is present.
#Which model best fits
#by significance = lcavol, lweight, svi, with age and bph's p-val= 0.1
#by backward BIC = lcavol, lweight, svi
multiReg_back <- lm(lpsa ~ .,prostate)
n <- nrow(prostate)
back.bic <- step(multiReg_back, direction = "backward", k= log(n), trace = 0)
back.bic
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + svi, data = prostate)
##
## Coefficients:
## (Intercept) lcavol lweight svi
## -0.2681 0.5516 0.5085 0.6662
#by backward AIC = lcavol, lweight, svi, age, lbph,
multiReg_back <- lm(lpsa ~ .,prostate)
back.aic <- step(multiReg_back, direction = "backward", trace = 0)
back.aic
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi, data = prostate)
##
## Coefficients:
## (Intercept) lcavol lweight age lbph svi
## 0.95100 0.56561 0.42369 -0.01489 0.11184 0.72095
#Best model showdown!
multiReg_best1 <- lm(lpsa ~ lcavol + lweight + svi + age + lbph, prostate)
summary(multiReg_best1)#Adjusted r^2 = 0.6245, #The winner
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + svi + age + lbph, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.83505 -0.39396 0.00414 0.46336 1.57888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.95100 0.83175 1.143 0.255882
## lcavol 0.56561 0.07459 7.583 2.77e-11 ***
## lweight 0.42369 0.16687 2.539 0.012814 *
## svi 0.72095 0.20902 3.449 0.000854 ***
## age -0.01489 0.01075 -1.385 0.169528
## lbph 0.11184 0.05805 1.927 0.057160 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7073 on 91 degrees of freedom
## Multiple R-squared: 0.6441, Adjusted R-squared: 0.6245
## F-statistic: 32.94 on 5 and 91 DF, p-value: < 2.2e-16
multiReg_best2 <- lm(lpsa ~ lcavol + lweight + svi, prostate)
summary(multiReg_best2)#Adjusted r^2 = 0.6144
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + svi, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.72964 -0.45764 0.02812 0.46403 1.57013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.26809 0.54350 -0.493 0.62298
## lcavol 0.55164 0.07467 7.388 6.3e-11 ***
## lweight 0.50854 0.15017 3.386 0.00104 **
## svi 0.66616 0.20978 3.176 0.00203 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7168 on 93 degrees of freedom
## Multiple R-squared: 0.6264, Adjusted R-squared: 0.6144
## F-statistic: 51.99 on 3 and 93 DF, p-value: < 2.2e-16
#The R^2 for the first one is higher and therefore the predictor variables explain more of the variance with multiReg_best1 than multiReg_best2. Also the researchers are looking for more a predictive model than an explanatory model, so keeping age and lbph is the better choice.
Question 4: 1.Noted 2.Regression with Interaction term Null Hypothesis: There is no effect of region on infant mortality rates even with the influence of income per capita accounted for. Alternative Hypothesis: There is an effect of region on infant mortality rates even with the influence of income per capita accounted for. Significance will be set to 0.10.
Results: P-value: 4.55e-13 There is affect of region on infant mortality rates even with the influence of income per capita accounted for. The Reject the Null. Adjusted R-squared = 0.227, the model explains 22.7% of the variability seen Coefficients are significant = Africa, Americas, and Europe regions along with interaction between Americas region and income Residual standard error=79.83 F-stat= 5.194
Conclusion: There is an effect of region on infant mortality rate. With the comparison point being Africa, the Americas and Europe decrease in mortality as compared to Africa. The Americas this especially true when factoring in the association income, which decreases as compared to Africa.
Comparing the two models, we find
#1
#Run Regression for Assumption functions to makes sense
Reg <- lm(mortality ~ region, data = infmort)
summary(Reg)
##
## Call:
## lm(formula = mortality ~ region, data = infmort)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.29 -37.52 -5.76 16.91 553.83
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 142.29 13.64 10.431 < 2e-16 ***
## regionAmericas -87.17 21.76 -4.005 0.000122 ***
## regionAsia -46.12 20.50 -2.249 0.026755 *
## regionEurope -123.04 23.19 -5.306 7.07e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 79.54 on 97 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.2556, Adjusted R-squared: 0.2326
## F-statistic: 11.1 on 3 and 97 DF, p-value: 2.494e-06
#EDA
#Preliminary EDA
dim(infmort)
## [1] 105 4
str(infmort)
## 'data.frame': 105 obs. of 4 variables:
## $ region : chr "Asia" "Europe" "Europe" "Americas" ...
## $ income : int 3426 3350 3346 4751 5029 3312 3403 5040 2009 2298 ...
## $ mortality: num 26.7 23.7 17 16.8 13.5 10.1 12.9 20.4 17.8 25.7 ...
## $ oil : chr "no oil exports" "no oil exports" "no oil exports" "no oil exports" ...
sum(is.na(infmort))#4 NAs in mortality
## [1] 4
infmort <- na.omit(infmort)
summary(infmort)
## region income mortality oil
## Length:101 Min. : 50 Min. : 9.60 Length:101
## Class :character 1st Qu.: 130 1st Qu.: 26.20 Class :character
## Mode :character Median : 334 Median : 60.60 Mode :character
## Mean :1022 Mean : 89.05
## 3rd Qu.:1191 3rd Qu.:129.40
## Max. :5596 Max. :650.00
infmort %>%
group_by(region) %>%
summarise(mean = mean(mortality)) %>% pander
| region | mean |
|---|---|
| Africa | 142.3 |
| Americas | 55.12 |
| Asia | 96.17 |
| Europe | 19.26 |
levels(factor(infmort$region))
## [1] "Africa" "Americas" "Asia" "Europe"
#Plots
ggplot(infmort, aes(mortality, fill= region))+
geom_boxplot()+
facet_grid(region ~ .)+
theme(legend.position = "none")
ggplot(infmort, aes(x = income, y = mortality, color = region))+
geom_point(alpha = 0.7)
#Assumptions
#Assumptions
#1.Independent of residuals = Assumed in design of experiment.
#2.Linearity = Not linear.
par(mfrow=c(2,2))
plot(Reg)
crPlots(Reg) #Residuals vs Fitted plot looks not straight.
#3 Homoscedasticity/ constant variability of resifduals = Not homoscedastic.
par(mfrow=c(2,2))
plot(Reg)#Scale-Location plot is not straight.
#4 Normality of Residuals = Residuals are not normal. Skewed to the right & kurtotic.
#See plot(Reg)'s Normal Q-Q plot OR the following:
infmort.data <- augment(Reg)# get the residuals
ggplot(infmort.data, aes(sample = .resid)) +
geom_qq() +
stat_qq_line(color = "green") # The outliers are way off the line
ggplot(infmort.data, aes(x = .resid)) +
geom_histogram(bins = 25)# Now this shows it isn't normal. Right skewed & kurtotic
(length(boxplot(infmort$mortality, plot=F)$out)/101)*100 #I could remove outliers here, but I've decided against it since 3% of the data are outliers. That seems like a lot to cut out.
## [1] 2.970297
#Dummy variable
infmortDum <- mutate(infmort, region.dum = factor(region))
Reg_Dum <- lm(mortality~region.dum, infmortDum)
summary(Reg_Dum)
##
## Call:
## lm(formula = mortality ~ region.dum, data = infmortDum)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.29 -37.52 -5.76 16.91 553.83
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 142.29 13.64 10.431 < 2e-16 ***
## region.dumAmericas -87.17 21.76 -4.005 0.000122 ***
## region.dumAsia -46.12 20.50 -2.249 0.026755 *
## region.dumEurope -123.04 23.19 -5.306 7.07e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 79.54 on 97 degrees of freedom
## Multiple R-squared: 0.2556, Adjusted R-squared: 0.2326
## F-statistic: 11.1 on 3 and 97 DF, p-value: 2.494e-06
#2 Answer
interaction <- lm(mortality ~ region.dum * income, data = infmortDum)
summary(interaction)
##
## Call:
## lm(formula = mortality ~ region.dum * income, data = infmortDum)
##
## Residuals:
## Min 1Q Median 3Q Max
## -101.04 -31.36 -2.58 16.25 559.77
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 130.91725 15.55534 8.416 4.55e-13 ***
## region.dumAmericas -66.30647 26.15577 -2.535 0.0129 *
## region.dumAsia -30.28530 24.14769 -1.254 0.2129
## region.dumEurope -96.26275 47.63292 -2.021 0.0462 *
## income 0.04163 0.02702 1.541 0.1268
## region.dumAmericas:income -0.05133 0.02982 -1.721 0.0886 .
## region.dumAsia:income -0.04842 0.03121 -1.552 0.1242
## region.dumEurope:income -0.04669 0.03018 -1.547 0.1253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 79.83 on 93 degrees of freedom
## Multiple R-squared: 0.2811, Adjusted R-squared: 0.227
## F-statistic: 5.194 on 7 and 93 DF, p-value: 5.055e-05
#Tried log to make it more normal, but it didn't do much
interaction.log <- lm(log(mortality) ~ region.dum * income, data = infmortDum)
infmort.data2 <- augment(interaction.log)
ggplot(infmort.data2, aes(sample = .std.resid)) +
geom_qq() +
stat_qq_line(color = "green")